Further Theoretical Study of Distribution Separation Method for Information Retrieval
نویسندگان
چکیده
Recently, a Distribution Separation Method (DSM) is proposed for relevant feedback in information retrieval, which aims to approximate the true relevance distribution by separating a seed irrelevance distribution from the mixture one. While DSM achieved a promising empirical performance, theoretical analysis of DSM is still need further study and comparison with other relative retrieval model. In this article, we first generalize DSM’s theoretical property, by proving that its minimum correlation assumption is equivalent to the maximum (original and symmetrized) KL-Divergence assumption. Second, we also analytically show that the EM algorithm in a well-known Mixture Model is essentially a distribution separation process and can be simplified using the linear separation algorithm in DSM. Some empirical results are also presented to support our theoretical analysis. Introduction Relevant feedback is an effective method in information retrieval, which can significantly improve the retrieval performance. However, the approximation of relevance model is usually still a mixture model containing an irrelevant component. A Distribution Separation Method is recently proposed to solve this problem.1 The formulation of the basic DSM was based on two assumptions, namely the linear combination assumption and minimum correlation assumption. The former assumes that the mixture term distribution is a linear combination of the relevance and irrelevance distributions, while the later assumes that the relevance distribution should have the minimum correlation with the irrelevance distribution. The basic DSM provided a lower bound analysis for the linear combination coefficient, based on which the desired relevance distribution can be estimated. It was also proved that the lower bound of the linear combination coefficient corresponds to the condition of the minimum Pearson correlation coefficient between DSM’s output relevance distribution and an input seed irrelevance distribution. In this article, we theoretically extend the generality of the aforementioned linear combination analysis and the minimum correlation analysis of DSM. First, we propose to explore the effect of DSM on the KL-divergence between DSM’s output distribution and the seed irrelevance distribution. We theoretically prove that the lower-bound analysis can also be applied to KL-divergence, and the minimum correlation coefficient corresponds to the maximum KL-divergence. We further prove that the decreasing correlation coefficient leads to a maximum symmetrized KL-divergence as well as JS-divergence. Second, we investigate the relations between DSM and a well-known Mixture Model Feedback (MMF) approach2 in information retrieval. We theoretically show that the EM-based iterative algorithm in MMF is essentially a distribution separation process and thus its iterative steps can be simplified by the linear separation technique developed in DSM without decline of performance. Basic Analysis of DSM In this section, we briefly describes assumptions and analysis of the basic DSM.1 We use M to represent the mixture term distribution derived from all the feedback documents, and we believe that M is a mixture of relevance term distribution R and irrelevance term distribution I. In addition, we assume that only part of the irrelevance distribution IS (also called as seed irrelevance distribution) is available, while the other part of irrelevance distribution is unknown (denoted as IS). The task of DSM can be defined as follows: given the mixture distribution M and a seed irrelevance distribution IS, to derive an output distribution that can approximates the R as closely as possible. Specifically, as shown in Figure 1, the task of DSM can be divided into two problems: (1) How to separate IS from M, and derive l(R, IS), which is less noisy but is still a mixture of the true relevance distribution (R) and the unknown irrelevance distribution (IS). (2) How to further refine the derived distribution l(R, IS) to approximate R as closely as possible? To solve the above two problems, DSM assumes that a Notation Description M Mixture term distribution R Relevance term distribution I Irrelevance term distribution. IS Seed Irrelevance distribution IS Unknown Irrelevance distribution F(i) Probability of the ith term in any distribution F l(F,G) Linear combination of distributions F and G
منابع مشابه
Generalized Analysis of a Distribution Separation Method
Separating two probability distributions from a mixture model that is made up of the combinations of the two is essential to a wide range of applications. For example, in information retrieval (IR), there often exists a mixture distribution consisting of a relevance distribution that we need to estimate and an irrelevance distribution that we hope to get rid of. Recently, a distribution separat...
متن کاملThe socio - cognitive theory in information retrieval (IR)
Abstract Background and Aim: The socio-cognitive theory introduced in information science by Horland and Alberchtsen. The socio-cognitive view turns the traditional cognitive program upside down. The socio-cognitive theory emphasizes on different cultural and social structures of users. Hence, the aim of the article is to explain the role of socio - cognitive theory in information retrieval (I...
متن کاملPublic Transport Ontology for Passenger Information Retrieval
Passenger information aims at improving the user-friendliness of public transport systems while influencing passenger route choices to satisfy transit user’s travel requirements. The integration of transit information from multiple agencies is a major challenge in implementation of multi-modal passenger information systems. The problem of information sharing is further compounded by the multi-l...
متن کاملReview of ranked-based and unranked-based metrics for determining the effectiveness of search engines
Purpose: Traditionally, there have many metrics for evaluating the search engine, nevertheless various researchers’ proposed new metrics in recent years. Aware of this new metrics is essential to conduct research on evaluation of the search engine field. So, the purpose of this study was to provide an analysis of important and new metrics for evaluating the search engines. Methodology: This is ...
متن کاملSubsurface modeling of mud volcanoes, using density model and analysis of seismic velocity
Detection of subsurface structures by means of gravity method can be used to determine mass distribution and density contrast of rock units. This distribution could be detected by different geophysical methods, especially gravity method. However, gravity techniques have some drawbacks and can't be always successful in distinguishing subsurface structures. Performance of the gravity technique co...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1510.03299 شماره
صفحات -
تاریخ انتشار 2015